Understanding the system health score

The Overview dashboard provides a single place from which you can quickly determine the health of the system.

The system health score can help you spot where your most severe health issues are, based on the following core factors: Configuration, Capacity, Data Protection, and Performance. The area with the highest risk to your system health lowers the score until remedial actions are taken.

The Overview panel displays values for the following high-level health metrics: Configuration, Capacity, Data Protection, and Performance. It also displays an overall health score that is based on the lowest health score out of these metrics. The health score is calculated every five minutes. The overall value is always calculated from all metric values. If a health score category is seen as stale or unknown, the overall health score is not updated. The previously calculated overall health score is displayed but its value is denoted as stale by setting the menu item to gray.

The Capacity health score is based on % Effective Used Capacity. Capacity levels are checked at the SRP level and SRP Emulation level (where a mixed SRP emulation is involved).

The capacity health scores are calculated as follows:

  • Fatal level - based on what is defined in the System Threshold and Alerts dialog. The default fatal threshold is 100% - 30 points.
  • Critical level - based on what is defined in the System Threshold and Alerts dialog. The default critical threshold is 80% - 20 points.

The Capacity health score is also impacted by the CloudIQ Capacity Prediction analytics algorithm, which forecasts when the SRP is predicted to be full, based on historical capacity usage. CloudIQ sends a health score deduction to Unisphere for PowerMax if the predicted range is within a quarter or less. The extent of the deduction depends on how close it is to reaching full capacity.

The Performance health score is calculated using the threshold limits of the following categories and metrics:

  • FE Director : % busy, queue depth utilization (queue depth utilization is not checked for EF directors).
  • FE port: % busy
  • BE port: % busy (not applicable for storage systems running PowerMaxOS 10 (6079))
  • BE Director (DA): % busy
  • SRDF port: % busy
  • SRDF Director: % busy
  • DX port: % busy (not applicable for storage systems running PowerMaxOS 10 (6079))
  • External Director: % busy (not applicable for storage systems running PowerMaxOS 10 (6079))
  • EDS Director: % busy (not applicable for storage systems running PowerMaxOS 10 (6079))
  • Cache Partition: %WP utilization (not applicable for storage systems running PowerMaxOS 10 (6079))
  • EM Director : %WP utilization (applicable for storage systems running PowerMaxOS 10 (6079))
  • System (Array): % Cache WP Utilization

For each instance and metric for a particular category, the threshold info is found. If not set, use the default thresholds. The default thresholds are:

FE Port: % busy - Critical 70

FE Director: % busy - Critical 70 ; Queue Depth Utilization - Critical 75

BE Port: % busy - Critical 70

BE Director (DA): % busy - Critical 70

SRDF Port: % busy - Critical 70

SRDF Director: % busy - Critical 70

DX Port: - % busy - Critical 70

External Director: % busy- Critical 70

EDS Director: % busy - Critical 70

Cache Partition: %WP utilization - Critical 75

EM Director: % busy - Critical 70

System (Array): % Cache WP Utilization - Critical 60

The Performance health score also incorporates the following:

  • Storage Group (SG)
    • Read Response Time: Critical - five points
    • Write Response Time: Critical - five points
    • Response Time: Critical - five points
  • Service Level compliance
  • Underperforming: - five points

For each storage group instance and metric for particular category, the threshold info is found. If not found, default thresholds are used.

The Response Time, Read Response Time, Write Response Time metrics are ignored from the health score calculation if the alert is not enabled for the SG/metric.

The Service Level Compliance health score is based on then Workload Planner (WLP) workload state. A reduction from the health score is performed when storage groups that have a defined service level are not meeting the service level requirements.

If the Service Level Compliance alert is disabled for the SG, it is ignored when performing the health score calculation.

The Performance score is calculated as follows:

  • Critical level: - five points

The Data Protection health score is based on if there are SRDF Groups in an offline state or Transmit Idle State. Both these scenarios result in a -5 point deduction to the Data Protection Health Score.

SRDF Groups associated with local SRDF ports that are offline cause a reduction of -2.

The Configuration health score is calculated every five minutes and is based on the director and port alerts in the system at the time of calculation. Unisphere does not support alert correlation or auto clearing, you are required to manually delete alerts that have been dealt with or are no longer relevant as these alerts impact on the hardware health score until they are removed from Unisphere.

The Configuration health score is calculated as follows:

  • Director out of service - 40 points
  • Director Offline - 20 points
  • Port Offline - 10 points

For embedded Unisphere systems that are running PowerMaxOS 6079 and that support PowerMax File, the configuration health score calculation is also impacted by the following PowerMax File alerts:

  • Recovery failed on file system 'fsname' in NAS server 'nasserver' (fsid 'fsid') - 30 points
  • File system 'fsname' in NAS server 'nasserver' (fsid 'fsid') is offline after discovering corruption - 30 points.
  • File system 'fsname' in NAS server 'nasserver' (fsid 'fsid') is offline due to receiving an I/O error -30 points.
  • Recovery failed on the 'fsname' file system in NAS server 'nasserver' (fsid 'fsid') - 30 points
  • The 'fsname' file system in NAS server 'nasserver' (fsid 'fsid') is offline after discovering corruption. - 30 points
  • The 'fsname' file system in NAS server 'nasserver' (fsid 'fsid') is offline due to receiving an I/O error - 30 points.
  • NAS node 'node' is down - 20 points.
  • NAS node 'node' is down and its automatic recovery has failed - 20 points.
  • NAS server 'nasserver' is down - 10 points.
  • NAS server 'nasserver' fault tolerance is degraded - 10 points.
  • NAS server 'nasserver' is in maintenance mode - 10 points.
  • The DNS client of the 'nasserver' NAS server is unable to connect to all configured DNS servers - 20 points.
  • No LDAP servers that are configured for NAS server 'nasserver' are responding - 20 points.
  • The LDAP service configuration of the NAS server 'nasserver' for domain 'domain' failed - 20 points
  • The NIS client is unable to connect to all configured NIS servers - 20 points.
  • The NAS server 'nasserver' in the domain 'domain' cannot reach any domain controller - 20 points.
  • No virus checker server is available - 5 points.
  • LDAP client settings on NAS server 'nasserver' are not valid within domain 'domain' - 5 points.
  • The SMB server of the NAS server 'nasserver' is configured to be joined to the domain 'domain', but is currently not joined - 5 points.

For embedded Unisphere systems that are running PowerMaxOS 10 (6079) and that support PowerMax File, auto clearing of PowerMax File alerts is supported.

Auto clearing of alerts is a mechanism whereby if an alert of severity Normal arrives into the system, this alert clears the alerts relating to this normal alert that are already there and have a higher severity.